The rapid global trend toward urbanization poses significant challenges for ensuring that rural populations are not left behind in development and policy considerations. Tracking vital statistics for these populations is increasingly difficult, leading to a shortage of reliable, up-to-date data. Private data, such as social media data, has shown the potential to complement or supplement the existing public data. This study explores using Facebook Advertising data, which offers a digital “census” of over two billion users, to measure and analyze rural-urban inequalities.
This vignette utilizes rSocialWatcher (Marty, 2020) to query Facebook Marketing data to demonstrate this packages potential for Social Science research. Facebook data is utilized to analyze statistically significant differences between chosen behaviors in urban and rural regions in Connecticut, United States. This vignette is based on Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide (Daniele Rama et al., 2020) which similarly uses Facebook Marketing data to study the urban-rural divide in Italy.
Below we load relevant packages and define Facebook keys.
## Load packages
library(rsocialwatcher)
library(dplyr)
library(tidyr)
library(ggplot2)
library(WDI)
library(janitor)
library(leaflet)
library(ggpubr)
library(knitr)
library(kableExtra)
library(sf)
library(spData)
TOKEN <- "Token"
VERSION <- "Version"
CREATION_ACT <- "Creation Act"
From the United States Census, we acquired American Community Survey 2020 household income data. We then select the following columns:
ct_income_df <- ct_income_data %>%
select(NAME, S1903_C01_001E, S1903_C03_001E)
ct_income_df <- ct_income_df[11:nrow(ct_income_df), 1:3]
ct_income_df$ZipCode <- as.integer(sub("ZCTA5 ", "", ct_income_df$NAME))
As Connecticut Zip Codes all being with “06”, we pull Facebook data for all Connecticut zip codes. We then visualize the regions by household income with Leaflet.
CT_zip_df <- get_fb_parameter_ids(type = "zip",
q = "06",
country_code = "US",
version = VERSION,
token = TOKEN)
head(CT_zip_df) %>% kable
| key | name | type | country_code | country_name | region | region_id | primary_city | primary_city_id | supports_region | supports_city |
|---|---|---|---|---|---|---|---|---|---|---|
| US:06902 | 06902 | zip | US | United States | Connecticut | 3849 | Stamford | 2425322 | TRUE | TRUE |
| US:06023 | 06023 | zip | US | United States | Connecticut | 3849 | Berlin | 2424573 | TRUE | TRUE |
| US:06068 | 06068 | zip | US | United States | Connecticut | 3849 | Salisbury | 2425242 | TRUE | TRUE |
| US:06277 | 06277 | zip | US | United States | Connecticut | 3849 | Thompson | 2425358 | TRUE | TRUE |
| US:06786 | 06786 | zip | US | United States | Connecticut | 3849 | Plymouth | 2425170 | TRUE | TRUE |
| US:06441 | 06441 | zip | US | United States | Connecticut | 3849 | Haddam | 2424844 | TRUE | TRUE |
zip_ct_sf$name <- as.integer(zip_ct_sf$name)
income_sf <- st_as_sf(ct_income_df %>%
left_join(zip_ct_sf, by = c("ZipCode" = "name")) %>%
select("ZipCode","geometry", "S1903_C03_001E"))
income_sf$S1903_C03_001E <- as.integer(gsub("[^0-9.-]", "", income_sf$S1903_C03_001E))
#> Warning: NAs introduced by coercion
pal <- colorNumeric(palette = "viridis", domain = income_sf$S1903_C03_001E)
leaflet(data = income_sf) %>%
addTiles() %>%
addPolygons(
fillColor = ~pal(as.integer(S1903_C03_001E)),
weight = 2,
color = "white",
fillOpacity = 0.7,
smoothFactor = 0.5,
popup = ~paste(ZipCode, ": $", S1903_C03_001E, sep = "")
)
which(income_sf$S1903_C03_001E < 20000)
#> [1] 214
income_sf[214,]
#> Simple feature collection with 1 feature and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -73.0573 ymin: 41.54603 xmax: -73.0205 ymax: 41.57236
#> Geodetic CRS: WGS 84
#> ZipCode S1903_C03_001E geometry
#> 214 6702 11663 MULTIPOLYGON (((-73.05539 4...
We query Facebook to get the Estimate DAU (Daily Active Users) for each zip code. We select only the zip code and the DAU.
total_people_ct_df <- query_fb_marketing_api(
location_unit_type = "zip",
location_keys = map_param_vec(zip_ct_sf$key[1:length(zip_ct_sf$key)]),
version = c(VERSION, VERSION1) ,
creation_act = c(CREATION_ACT, CREATION_ACT1)
token = c(TOKEN, TOKEN1) )
| X | estimate_dau | estimate_mau_lower_bound | estimate_mau_upper_bound | location_unit_type | location_types | location_keys | gender | age_min | age_max | api_call_time_utc |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 70376 | 85100 | 100100 | zips | home or recent | US:06902 | 1 or 2 | 18 | 65 | 2024-04-29 09:11:38.661257 |
| 2 | 832 | 1000 | 1000 | zips | home or recent | US:06023 | 1 or 2 | 18 | 65 | 2024-04-29 09:11:39.381142 |
| 3 | 1152 | 1400 | 1600 | zips | home or recent | US:06068 | 1 or 2 | 18 | 65 | 2024-04-29 09:11:40.306598 |
| 4 | 3377 | 4200 | 5000 | zips | home or recent | US:06277 | 1 or 2 | 18 | 65 | 2024-04-29 09:11:41.243462 |
| 5 | 5609 | 7000 | 8200 | zips | home or recent | US:06786 | 1 or 2 | 18 | 65 | 2024-04-29 09:11:41.929814 |
| 6 | 2763 | 4100 | 4900 | zips | home or recent | US:06441 | 1 or 2 | 18 | 65 | 2024-04-29 09:11:42.942949 |
total_people_ct_df$ZipCode <- sub("US:", "", total_people_ct_df$location_keys)
total_people_ct_df <- total_people_ct_df %>%
select("ZipCode","estimate_dau")
head(total_people_ct_df) %>% kable
| ZipCode | estimate_dau |
|---|---|
| 06902 | 70376 |
| 06023 | 832 |
| 06068 | 1152 |
| 06277 | 3377 |
| 06786 | 5609 |
| 06441 | 2763 |
These are the keys for people using iOS devices and Androids respectively:
behaviors_df <- get_fb_parameter_ids("behaviors", VERSION, TOKEN)
beh_mobile_ids = c("6004384041172", "6004386044572")
fb_df <- query_fb_marketing_api(
location_unit_type = "zip",
location_keys = map_param_vec(zip_ct_sf$key[1:length(zip_ct_sf$key)]),
behaviors = map_param_vec(beh_mobile_ids),
version = c(VERSION, VERSION1) ,
creation_act = c(CREATION_ACT, CREATION_ACT1)
token = c(TOKEN, TOKEN1) )
p_1a <- income_sf %>%
ggplot() +
geom_histogram(aes(x = S1903_C03_001E),
fill = "#4267B2",
color = "black",
bins = 30) +
labs(x = "Income",
y = "N ZipCodes",
title = "Income Distribution of Zip Codes in CT") +
theme_classic2()
p_1b <- total_people_ct_complete %>%
ggplot() +
geom_histogram(aes(x = estimate_dau),
fill = "#4267B2",
color = "black",
bins = 30) +
# scale_x_log10() +
labs(x = "Daily Active Users",
y = "N ZipCodes",
title = "Daily Active User Distribution of Zip Codes in CT") +
theme_classic2()
ggarrange(p_1a, p_1b, nrow = 1, widths = c(0.45, 0.55))
To clean the data for understanding the relationship between iOS/Android devices and income, we filter our datasets so that we only consider zip codes with over 2000 Facebook dau and over 1000 Facebook iOS/Android dau.
ct_income_df$ZipCode <- as.integer(ct_income_df$ZipCode)
fb_df$ZipCode <- as.integer(fb_df$ZipCode)
ios_data <- fb_df %>%
filter(behaviors == 6004384041172) %>%
inner_join(ct_income_df, by = "ZipCode") %>%
inner_join(total_people_ct_complete, by = "ZipCode") %>%
filter(estimate_dau.y > 2000) %>%
filter(estimate_dau.x > 500) %>%
mutate(percent_ios = estimate_dau.x/estimate_dau.y) %>%
filter(percent_ios < 1) %>%
select("ZipCode","percent_ios","S1903_C03_001E")
android_data <- fb_df %>%
filter(behaviors == 6004386044572) %>%
inner_join(ct_income_df, by = "ZipCode") %>%
inner_join(total_people_ct_complete, by = "ZipCode") %>%
filter(estimate_dau.y > 2000) %>%
filter(estimate_dau.x > 500) %>%
filter(estimate_dau.y > estimate_dau.x) %>%
mutate(percent_android = estimate_dau.x/estimate_dau.y) %>%
select("ZipCode","percent_android","S1903_C03_001E")
combined_data <- ios_data %>%
inner_join(android_data, by= "ZipCode")
p_2a <- ios_data %>%
ggplot() +
geom_histogram(aes(x = percent_ios),
fill = "#4267B2",
color = "black",
bins = 30) +
labs(x = "Percent iOS Users",
y = "N ZipCodes",
title = "iOS Distribution of Zip Codes") +
scale_x_continuous(labels = scales::percent) +
theme_classic2()
p_2b <- android_data %>%
ggplot() +
geom_histogram(aes(x = percent_android),
fill = "#4267B2",
color = "black",
bins = 30) +
# scale_x_log10() +
scale_x_continuous(labels = scales::percent) +
labs(x = "Percent Android Users",
y = "N ZipCodes",
title = "Android Distribution of Zip Codes") +
theme_classic2()
p_2c <- combined_data %>%
ggplot() +
geom_histogram(aes(x = 1 - percent_android - percent_ios),
fill = "#4267B2",
color = "black",
bins = 30) +
# scale_x_log10() +
scale_x_continuous(labels = scales::percent) +
labs(x = "Percent Other Users",
y = "N ZipCodes",
title = "Other Distribution of Zip Codes") +
theme_classic2()
ggarrange(p_2a, p_2b, p_2c, nrow = 1, widths = c(0.33, 0.33,0.33))
p_theme <- theme(plot.title = element_text(face = "bold", size = 10),
plot.subtitle = element_text(face = "italic", size = 10),
axis.text = element_text(color = "black"),
axis.title = element_text(size = 10))
ios_plot <- ggplot(data = ios_data) +
geom_smooth(method = "lm", se = TRUE, aes(x = as.numeric(percent_ios), y = as.numeric(S1903_C03_001E)), color = "blue") +
labs(x = "% Users Access through iOS",
y = "Income") +
xlim(0.2, 0.6) +
ylim(0, 220000) +
geom_jitter(mapping = aes(x = as.numeric(percent_ios), y = as.numeric(S1903_C03_001E)), width = 0,
height = 0.1,
pch = 21,
size = 2,
fill = "red") +
labs(x = "% Users Access through iOS",
y = "Income",
title = "B. Income vs. % of Facebook users with iOS") +
theme_classic2() +
p_theme
android_plot <- ggplot(data = android_data) +
geom_smooth(method = "lm", se = TRUE, aes(x = as.numeric(percent_android), y = as.numeric(S1903_C03_001E)), color = "blue") +
xlim(0.2,0.6) +
ylim(0, 220000) +
geom_jitter(mapping = aes(x = as.numeric(percent_android), y = as.numeric(S1903_C03_001E)), width = 0,
height = 0.1,
pch = 21,
size = 2,
fill = "red") +
labs(x = "% Users Access through Android",
y = "Income",
title = "B. Income vs. % of Facebook users with Android") +
theme_classic2() +
p_theme
ggarrange(ios_plot, android_plot, nrow = 1)
This part of the vignette will more generally study urban and rural divide in interests/behaviors, as defined by Facebook Marketing, between zip codes in Connecticut. Without access to private social media data, such studies would be incredibly difficult and expensive to execute.
First, we must define if a Zip Code is urban or rural. We take two approaches to do this, The first method is to create a binary classifier based on the distance from an urban area in Connecticut (i.e. if a Zip Code is within a certain distance from a city, it is considered urban). The other way is to consider the distance from the center of a zip code to the nearest city.
First we select the cities in Connecticut with a population greater than 100,000 (Waterbury, Stanford, Hartford, New Haven, Bridgeport). We consider any zip code wihtin 5 kilometers of a city as an urban area.
us_states_df <- get_fb_parameter_ids(type = "region",
version = VERSION,
token = TOKEN,
country_code = "US")
#> No encoding supplied: defaulting to UTF-8.
ct_key <- us_states_df %>% filter(name == "Connecticut") %>% pull(key)
ct_cities_df <- get_fb_parameter_ids(type = "city",
version = VERSION,
token = TOKEN,
region_id = ct_key,
q = "Connecticut")
#> No encoding supplied: defaulting to UTF-8.
city_keys <- ct_cities_df %>%
filter(type == "city") %>%
filter(name %in% c("Waterbury","New Haven", "Bridgeport","Stamford", "Hartford")) %>%
pull(key)
ct_cities_coords <- get_location_coords(location_unit_type = "city",
location_keys = city_keys,
version = VERSION,
token = TOKEN)
#> No encoding supplied: defaulting to UTF-8.
ct_cities_sf <- st_as_sf(ct_cities_coords, coords = c("longitude", "latitude"), crs = 4326, agr = "constant")
cities_buffered <- st_buffer(ct_cities_sf, dist = 5000)
leaflet() %>%
addTiles() %>%
addCircles(data = ct_cities_sf,
popup = ~name,
radius = 600,
color = "red") %>%
addPolygons(data = cities_buffered,
popup = ~name)
Based off of our defintion, we can now classify counties into urban or rural. We also save distances between the center of Zip Codes and the closest cities for our second definition.
zip_ct_sf_valid <- st_make_valid(zip_ct_sf)
sf_object <- st_simplify(zip_ct_sf_valid, preserveTopology = TRUE)
centroids <- st_centroid(sf_object)
#> Warning: st_centroid assumes attributes are constant over geometries
distances <- st_distance(centroids, ct_cities_sf)
min_distances <- apply(distances, 1, min)
zip_ct_sf <- cbind(zip_ct_sf, min_distances)
intersection <- st_intersection(cities_buffered, zip_ct_sf_valid)
#> Warning: attribute variables are assumed to be spatially constant throughout
#> all geometries
intersecting_zips <- data.frame(name.1 = unique(intersection$name.1))
zip_ct_sf_valid$intersects_city <- FALSE
zip_ct_sf_valid <- zip_ct_sf_valid %>%
mutate(intersects_city = name %in% intersecting_zips$name.1)
leaflet() %>%
addTiles() %>%
addPolygons(data = zip_ct_sf_valid,
popup = ~name,
color = ~ifelse(intersects_city, "red", "blue"),
fillColor = ~ifelse(intersects_city, "red", "blue"),
fillOpacity = 0.5) %>%
addCircles(data = ct_cities_sf,
popup = ~name,
radius = 600,
color = "black") %>%
addPolygons(data = cities_buffered,
popup = ~name,
color = "black")
We select certain interests and behaviors that were also studied in Rama et al. 2020 to compare differences. The interests and behaviors of note are shown below. We use rSocialWatcher to pull and save the data here from Facebook Marketplace.
interests_df <- get_fb_parameter_ids("interests", VERSION, TOKEN)
#> No encoding supplied: defaulting to UTF-8.
behaviors_df <- get_fb_parameter_ids("behaviors", VERSION, TOKEN)
#> No encoding supplied: defaulting to UTF-8.
selected_behaviors <- c("Frequent Travelers",
"Frequent international travelers",
"Technology early adopters")
behaviors_id <- behaviors_df %>%
filter(name %in% selected_behaviors) %>%
pull(id)
selected_interests <- c("Gambling (gambling)",
"Organic food (food & drink)",
"Concerts (music event)",
"Fast food restaurants (dining)",
"Tattoos (body art)",
"Reading (communication)",
"Restaurants (dining)",
"Fitness and wellness (fitness)",
"Cooking (food & drink)",
"Vegetarianism (diets)",
"Jazz music (music)",
"Hunting (sport)",
"Fine art (visual art)",
"Politics (politics)")
interests_id <- interests_df %>%
filter(name %in% selected_interests) %>%
pull(id)
for (behavior in behaviors_id[2:length(behaviors_id)]) {
fb_df <- query_fb_marketing_api(
location_unit_type = "zip",
location_keys = map_param_vec(zip_ct_sf$key[1:length(zip_ct_sf$key)]),
behaviors = behavior,
version = c(VERSION, VERSION1),
creation_act = c(CREATION_ACT, CREATION_ACT1),
token = c(TOKEN, TOKEN1)
)
write.csv(fb_df, file = paste0("data/", behavior, ".csv"))
print(paste("Done with", behavior))
}
for (interest in interests_id[3:4]) {
fb_df <- query_fb_marketing_api(
location_unit_type = "zip",
location_keys = map_param_vec(zip_ct_sf$key[1:length(zip_ct_sf$key)]),
interests = interest,
version = c(VERSION, VERSION1),
creation_act = c(CREATION_ACT, CREATION_ACT1),
token = c(TOKEN, TOKEN1)
)
write.csv(fb_df, file = paste0("data/", interest, ".csv"))
print(paste("Done with", interest))
}
Using the data queried, we perform our analysis. First we run a regression based on the distances from the cities. The regression coefficients between urban and rural centers is then used to denote the urban/rural difference. We then use the binary classification between rural and urban counties. Here, we measure the differences in the mean, and use a t-test to determine the 95% confidence interval.
Below are the results. Note that the results between the two methodologies returns relatively similar answers.
lr_result_df <- data.frame(Coefficient = numeric(), Lower_CI = numeric(), Upper_CI = numeric(), behavior = character())
mapply(function(x, y) {
data <- read.csv(paste0("data/", y,".csv"))
data$ZipCode <- sub("US:", "", data$location_keys)
zip_ct_sf$ZipCode <- as.integer(sub("US:", "", zip_ct_sf$key))
data$ZipCode <- as.integer(data$ZipCode)
ct_income_df$ZipCode <- as.integer(ct_income_df$ZipCode)
merged_data <- data %>%
inner_join(total_people_ct_complete, by = "ZipCode") %>%
filter(estimate_dau.y > 2000) %>%
filter(estimate_dau.x > 500) %>%
mutate(percent_pop = estimate_dau.x/estimate_dau.y) %>%
inner_join(zip_ct_sf, by = "ZipCode") %>%
select("ZipCode","percent_pop","min_distances")
lm_model <- lm(percent_pop ~ min_distances, data = merged_data)
min_distance_coef <- coef(lm_model)["min_distances"]
conf_intervals <- confint(lm_model)
min_distance_conf_int <- conf_intervals["min_distances", ]
lr_result_df <<- rbind(lr_result_df, data.frame(
Coefficient = min_distance_coef,
Lower_CI = min_distance_conf_int[1],
Upper_CI = min_distance_conf_int[2],
behavior = x
))
print(paste0("Done with ", x[1]))
}, c(selected_behaviors, selected_interests[4:14]), c(behaviors_id, interests_id[4:14]))
mean_result_df <- data.frame(Difference_in_Means = numeric(), Lower_CI = numeric(), Upper_CI = numeric(), behavior = character())
mapply(function(x, y) {
data <- read.csv(paste0("data/", y,".csv"))
data$ZipCode <- sub("US:", "", data$location_keys)
zip_ct_sf_valid$ZipCode <- as.integer(sub("US:", "", zip_ct_sf_valid$key))
data$ZipCode <- as.integer(data$ZipCode)
ct_income_df$ZipCode <- as.integer(ct_income_df$ZipCode)
merged_data <- data %>%
inner_join(total_people_ct_complete, by = "ZipCode") %>%
filter(estimate_dau.y > 2000) %>%
filter(estimate_dau.x > 500) %>%
mutate(percent_pop = estimate_dau.x/estimate_dau.y) %>%
inner_join(zip_ct_sf_valid, by = "ZipCode") %>%
select("ZipCode","percent_pop","intersects_city")
# Grouping data by intersects_city
group1 <- merged_data$percent_pop[merged_data$intersects_city == 0]
group2 <- merged_data$percent_pop[merged_data$intersects_city == 1]
# Performing t-test
t_test_result <- t.test(group1, group2)
# Calculating difference in means and confidence interval manually
diff_means <- mean(group1) - mean(group2)
se <- sqrt(var(group1)/length(group1) + var(group2)/length(group2))
alpha <- 0.05
t_critical <- qt(1 - alpha/2, df = length(group1) + length(group2) - 2)
conf_interval <- c(diff_means - t_critical * se, diff_means + t_critical * se)
mean_result_df <<- rbind(mean_result_df, data.frame(
Difference_in_Means = diff_means,
Lower_CI = conf_interval[1],
Upper_CI = conf_interval[2],
behavior = x
))
#print(paste0("Done with ", x))
}, c(selected_behaviors, selected_interests[4:14]), c(behaviors_id, interests_id[4:14]))
p_a1 <- ggplot(lr_result_df, aes(x = behavior, y = Coefficient, ymin = Lower_CI, ymax = Upper_CI)) +
geom_pointrange() + # Adds the points for coefficients and vertical lines for confidence intervals
geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Adds a dashed red line at zero
coord_flip() + # Makes the plot horizontal
theme_minimal() + # Applies a minimalistic theme
labs(title = "Regression Coefficient with City Distance with 95% CI",
x = "Behavior", y = "Coefficient") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
annotate("text", x = min(lr_result_df$behavior), y = max(lr_result_df$Coefficient)*1.1, label = "Less Urban", vjust = 3) +
annotate("text", x = min(lr_result_df$behavior), y = min(lr_result_df$Coefficient)*1.9, label = "More Urban", vjust = 3)
p_a2 <- ggplot(mean_result_df, aes(x = behavior, y = Difference_in_Means, ymin = Lower_CI, ymax = Upper_CI)) +
geom_pointrange() + # Adds the points for coefficients and vertical lines for confidence intervals
geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Adds a dashed red line at zero
coord_flip() + # Makes the plot horizontal
theme_minimal() + # Applies a minimalistic theme
labs(title = "Difference in Mean with 95% CI",
x = "Behavior", y = "Difference in Means") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(axis.text.y = element_blank()) +
annotate("text", x = min(mean_result_df$behavior), y = -0.02, label = "More Urban", vjust = 3) +
annotate("text", x = min(mean_result_df$behavior), y = 0.06, label = "Less Urban", vjust = 3)
ggarrange(p_a1, p_a2, nrow = 1, widths = c(0.6, 0.4))